Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
作者信息
Amey Agrawal , Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee
微软印度研究院和GIT的
链接:[2403.02310] Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
摘要
Each LLM serving request goes through two phases. The first is prefill which processes the entire input prompt and produces the first output token and the second is decode which generates the rest of output tokens, one-at-a-time. Prefill iterations have high latency but saturate GPU compute due to parallel processing of the input prompt【prefill延迟高,硬件利用率高】. In contrast, decode iterations have low latency but also low compute utilization because a decode iteration processes only a single token per request【decode延迟低,硬件利用率低】. This makes batching highly effective for decodes and consequently for overall throughput. However, batching multiple requests leads to an interleaving of prefill and decode iterations which makes it challenging to achieve both high throughput and low latency. We introduce an efficient LLM inference scheduler, Sarathi-Serve, to address this throughput-latency tradeoff. Sarathi-Serve introduces chunked-prefills which splits a prefill request into near equal sized chunks and creates stall-free schedules that adds new requests in a batch without pausing ongoing decodes. Stall-free scheduling unlocks the opportunity to improve throughput with large batch sizes while minimizing the effect of batching on latency. Furthermore, uniform batches in Sarathi-Serve ameliorate the imbalance between iterations resulting in minimal pipeline bubbles. Our techniques yield significant improvements in inference performance across models and hardware under tail latency constraints. For Mistral-7B on single A100 GPUs, we achieve 2.6x higher serving capacity and up to 3.7x higher serving capacity for the Yi-34B model on two A100 GPUs as compared to vLLM. When used with pipeline parallelism on Falcon-180B, Sarathi-Serve provides up to 5.6x gain in the end-to-end serving capacity. The source code for Sarathi-Serve is available at this https URL.
一句话总结概括
主要是讲chunked prefill和decode并发的机制
Motivation
PD吞吐量比较:
- Prefill比Decode效率更高
- Prefill随着Batch Size吞吐量变化不大,Decode随着Batch Size吞吐量比较大
PD阶段具体时间:
MLP占主要计算时间
Bound分析:
Prefill是Compute Bound,Decode是Memory Bound,使得计算和通信最大化的时候,系统性能最好。
分界点分析
当Token数目少的时候,线性层执行时间随着Token数量增多是变化不大的。但随着Token数目达到一个临界点,即系统达到Compute Bound的时候,线性层执行时间是逐渐变为线性增长的。
流水线并行分析
流水线并行会带来Bubble问题,如何实现流水线的均衡是一个问题。
创新点或贡献
具体设计
主要就是一个做chunked prefill的工作
设计Token Budget:
- chunk 太多会增加GPU的内存读取,作者表示即使在小chunk size的情况下依旧是compute bound的,所以解决方案就是在不违反TBT slo的情况下尽量增大chunk的大小。
- GPU在线性层计算时会有突变的情况。
- 假如budget太大还可能会带来流水线的bubble的问题,太小又会因为计算强度太低而浪费性能。
所以使用了他们在Mlsys 2024的工作Vidur来确定token预算。
实验评估
背景
先前工作存在的问题概述
难点
补充背景
TP可以线性地增加batch size,但TP在Attention和MLP各有一个all reduce操作,这里的通信开销比较大,一般使用NV Link。
PP只需要在各阶段间发送一次激活值,计算-通信比更大,一般在跨界点部署中更加有效。